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Abstract 



Periodicity in words is one of the most fundamental areas of text algorithms and combi- 
natorics. Two classical and natural variations of periodicity are seeds and covers (also called 
quasiperiods) . Linear-time algorithms are known for finding all the covers of a word, however 
in case of seeds, for the past 15 years only an O(nlogn) time algorithm was known (Iliopoulos, 
Moore and Park, 1996). Finding an o(n log n) time algorithm for the all-seeds problem was men- 
tioned as one of the most important open problems related to repetitions in words in a survey 
by Smyth (2000). We show a linear-time algorithm computing all the seeds of a word, in par- 
t/3 , ticular, the shortest seed. Our approach is based on the use of a version of LZ-factorization and 

^i' non-trivial combinatorial relations between the LZ-factorization and seeds. It is used here for 

the first time in context of seeds. It saves the work done for factors processed earlier, similarly 
as in Crochemore's square-free testing. 
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1 Introduction 

The notion of periodicity in words is widely used in many fields, such as combinatorics on words, 
pattern matching, data compression, automata theory formal language theory, molecular biology 
etc. (see [26] )• The concept of quasiperiodicity is a generalization of the notion of periodicity, 
and was defined by Apostolico &: Ehrenfeucht in p!]. A quasiperiodic word is entirely covered by 
occurrences of another (shorter) word, called the quasiperiod or the cover. The occurrences of 
the quasiperiod may overlap, while in a periodic repetition the occurrences of the period do not 
overlap. Hence, quasiperiodicity enables detecting repetitive structure of words when it cannot be 
found using the classical characterizations of periods. An extension of the notion of a cover is the 
notion of a seed — a cover which is not necessarily aligned with the ends of the word being covered, 
but is allowed to overflow on either side. Seeds were first introduced and studied by Iliopoulos, 
Moore and Park pO] . 

Covers and seeds have potential applications in DNA sequence analysis, namely in the search for 
regularities and common features in DNA sequences. The importance of the notions of quasiperiod- 
icity follows also from their relation to text compression. Due to natural applications in molecular 
biology (a hybridization approach to analysis of a DNA sequence), both covers and seeds have also 
been extended in the sense that a set of factors are considered instead of a single word [22] . This 
way the notions of A:-covers [SI [19] , A-covers [17] and A-seeds |16j were introduced. In applications 
such as molecular biology and computer- assisted music analysis, finding exact repetitions is not al- 
ways sufficient, the same problem holds for quasiperiodic repetitions. This lead to the introduction 
of the notions of approximate covers [29j and approximate seeds [6] . 

1.1 Previous results 

Iliopoulos, Moore and Park [20j gave an 0(n log n) time algorithm computing all the seeds of a 
given word w € S". For the next 15 years, no o{n log n) time algorithm was known for this problem. 
Computing all the seeds of a word in linear time was also set as an open problem in the survey [30]. 
A parallel algorithm computing all the seeds in O(logn) time and 0{n^~^'^) space using n processors 
in the CRCW PRAM model was given by Berkman et al. [3j. An alternative sequential 0(n log n) 
algorithm for computing the shortest seed was recently given by Christou et al. |7]. 

In contrast, a linear time algorithm finding the shortest cover of a word was given by Apostolico 
et al. [2J and later on improved into an on-line algorithm by Breslauer ^. A linear time algorithm 
computing all the covers of a word was proposed by Moore & Smyth |28j . Afterwards an on-line 
algorithm for the all-covers problem was given by Li &: Smyth [25]. 

Another line of research is finding maximal quasiperiodic subwords of a word. This notion 
resembles the maximal repetitions (runs) in a word j23] , which is another widely studied notion 
of combinatorics on words. 0(n log n) time algorithms for reporting all maximal quasiperiodic 
subwords of a word of length n have been proposed by Brodal & Pedersen [5] and Iliopoulos & 
Mouchard |21] . these results improved upon the initial 0(n log n) time algorithm by Apostolico &: 
Ehrenfeucht [1]. 

1.2 Our results 

We present a linear time algorithm computing a linear representation of all the seeds of a word. Our 
algorithm runs in linear time due to combinatorics on words: the connections between seeds and 
factorizations stated in the following Reduction-Property Lemma and Work-Skipping Lemma; and 
algorithmics: linear time implementation of merging smaller results into larger, due to efficient 



aaabaabaabaaabaaba 
Figure 1: The word abaa is the shortest seed of the word aaoBaabaabaaabaaba. 

processing of subtrees of a suffix tree (the function ExtractSubtrees), and efficient computation of 
long candidates for seeds (the function ComputelnRange) . 

An important tool used in the paper is the /-factorization which is a variant of the Lempel-Ziv 
factorization [31J. It plays an important role in optimization of text algorithms (see [9| [24l \27\). 
Intuitively, we can save some work, reusing the results computed for the previous occurrences of 
factors. We also apply this technique here. The /-factorization can be computed in linear time 
[lTl[13], using so called Longest Previous non- overlapping Factor (LPnF) table. 

Another important text processing tool is the suffix tree [10^ [12] . The suffix tree of a word is a 
compact TRIE of all the suffixes of the word. The suffix tree can be constructed in 0{n) time for 
an integer alphabet S (directly p!4| or from the suffix array [10^ [T2] ) . 

In the algorithm we assume that the alphabet S consists of integers, and its size is polynomial 
in n. Hence, the letters of w can be sorted in linear time. 

1.3 Preliminaries 

We consider words over S, u G S*; the positions in u are numbered from 1 to |u|. By S" we denote 
the set of words of length n. For u = uiU2 ■ ■ ■ Un, let us denote by u[i . . j] a factor of u equal to 

Ui . . .Uj. 

We say that a word ii is a cover of w (covers w) if every letter of w is within some occurrence 
of w as a subword of w. A word w is a seed of w if it is a subword of w and w is a subword of some 
word u covered by v, see Fig. [TJ The following lemma provides a useful property of covers which 
we will use also for seeds. 



Lemma 1. If v is a cover of w then there exists a sequence of at most -J-A occurrences of v 
which completely covers w. 

Proof. The proof goes by induction over \w\. If \w\ < 2 \v\ then the conclusion holds, since v is 
both a prefix and a suffix of w. Otherwise, let i be the starting position of the last occurrence of v 
in w such that i < \v\. Now let j be the first position of the next occurrence of v in w. Note that 
both positions i, j exist and that j > \v\. 

The word t; is a cover of w[j . . \w\]. By the inductive hypothesis, w[j . . \w\] can be covered by at 



most 



2{\w\-j+l) 



2 \w\ 



2 occurrences of v. Hence, w can be covered by all these occurrences 



together with those starting at 1 and at i. This concludes the inductive proof. D 

2 The tools: factorizations and staircases 

From now on, let us fix a word lu € S". The intervals [i ■ ■ j] = {i,i + 1, . . . ,j} will be denoted by 
small Greek letters. Assume all intervals considered satisfy 1 < i < j < n. We denote by I7I the 
length of the interval 7. 

By a factorization of a word w we mean a sequence F = (/i, /2, . . . , fx) of factors of w such 
that w = /1/2 ■ ■ ■ fx and for each i = 1,2, ... ,K, fi is a subword of /i . . . /i_i or a single letter. 
Denote \F\ = K. A factorization of w is called an f -factorization [lOj if F has the minimal number 
of factors among all the factorizations of w. An /-factorization is constructed in a greedy way, as 



follows. Let 1 < i < K and j = I/1/2 • • • fi-i\ + 1- If w[j] occurs in /1/2 . . . fi~i, then /j is the 
longest prefix of w[j . . \w\] that is a subword of tt;[l . . j — 1]. Otherwise fi = w\j]. 

There is a useful relation between covers and /-factorization, which then extends for seeds (see 
the Quasiseeds- Factors Lemma in the next section). For a factorization F and interval A, denote 
by Fx the set of factors of F which start and end within A. 

Lemma 2. Let F be an f -factorization of w and v be a cover of w. Then for A = [\v\ + 1 . . \w\] 
we have: 



\Fx\< 



2 \w\ 



Proof. We assume that \w\ > \v\, otherwise the conclusion is trivial. By Lemma [H the word w can 
be completely covered with some occurrences of v at the positions ii, ^2, . . . , ip, 1 = ii < «2 < • • • < 



ip, where p < 
such that: 



2H 

\v\ 



. Additionally define Zp+i = \w\ + 1. Then there exists a factorization F' of w 



F\ = {wM + 1 . . ^3 - 1]} U {w[ij . . ij+i - 1] : j = 3, 4, . . . ,p}. 



This set forms a factorization ofu;[|u| + l..|w|] and consists of p — 1 elements. This concludes 
that for any /-factorization of w the set Fx consists of at most p — 1 elements, since otherwise we 
could substitute all the elements from this set by F'^, shorten the rightmost element oi F\Fx to 
the position \v\ and thus transform F into a factorization of w with a smaller number of factors, 
which is not possible. D 

Another useful concept is that of a staircase of intervals. An interval staircase covering w \s a, 
sequence of intervals of length 2>m with overlaps of size 2m., starting at the beginning of w; possibly 
the last interval is shorter. More formally, an m-staircase covering w for a given 1 < tti < n, is a 
set of intervals covering w, defined as follows: 

5= |[A:.m + l..(A; + 3) •m-l]n[l..n],for /i: = 0, l,...,max^O, — - S^j \ {0}. 

A first informal description of the algorithm: The main algorithm finding the seeds 
works recursively. The computation for the whole word w is reduced to the computation for several 
subwords of w. One could try to use the subwords corresponding to an interval staircase S, however 
this would not bring any gain in the time complexity, since the total length of the intervals from 
S could be about 3n. Instead, in the algorithm a reduced staircase /C is used for determining the 
recursive calls (the definition of this notion follows). By the following Reduction-Property Lemma, 
for a smart choice of the parameter m the total length of the intervals in /C is less than ^n, which 
is very important in the analysis of the time complexity of the whole algorithm. 

Let F = (/i, /2, • • • , fx) be an /-factorization of w. If S is an m-staircase covering w, let /C 
be the family of those intervals A = [i ■ ■ j] G 5, that w[i . . j + m] (an extended interval) does not 
lie within a single factor of F, that is w[i . . j + m] overlaps more than one factor of F. Then we 
say that /C is obtained by a reduction of S with regard to the f -factorization F and denote this by 
/C = Reduce{S,F). 

There is a simple and useful relation between this reduction and the number of factors in a 
given factorization. 

Lemma 3. [Staircase-Factors Lemma] 

Assume we have an m-staircase S covering w and any factorization F ofw. Then 

\Reduce{S,F)\ < 4 (|F| -1). 
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Figure 2: An m-staircase S of intervals covering w and an /-factorization F = (/i, /2, . . . , /g) of w. 
The shaded intervals form the family /C = Reduce{S,F). 



Proof. Each inter-position in w is covered by at most 4 extended intervals in a staircase. 



D 



Define the following property of the constants ci,C2,no > 0: 

Reduction-Property: 

Let A = [cinj, n > riQ. Let F be an /-factorization of w, and let g 

from the latter set are called the middle factors of F. Let m = C2^ 
then for each rri-staircase S: 



= l-^[2A+i..n-A]l; the factors 
, or m = if (7 = 0. If m > 



g>3 A \\Reduce{S,F)\\ < -n. 



Here ||c7|| denotes the total length of the intervals in a family J'. The intervals from Reduce{S,F) 
are called the working intervals. Informally speaking, the reduction-property means that for suitable 
positive constants ci,C2,no the total size of working intervals is smaller than 2^- 

Lemma 4. [Reduction-Property Lemma] 

The reduction-property holds for the constants ci = C2 



^, no = 200. 



Proof. Let us start with showing that g > 3, that is, that any /-factorization F has at least three 
middle factors. First of all, we have A(n) > ciUq > 0. 

Note that if a factor f (z F starts at the position i in w, then |/| < i. Hence, the first factor in 
-^[2A+i..n-A]) if exists, starts at a position not exceeding 4A, and thus has the length at most 4A. 
Similarly, the second middle factor has the length at most 8A and the third has the length at most 
16 A. In total, the occurrence of the third considered factor ends at a position not exceeding 32 A, 
and 32A < n — A by the choice of the parameter ci. This concludes that there must be at least 
three middle factors. 

Now we proceed to the proof of the fact that ||/C|| < in, where /C = Reduce{S, F). For this, we 
divide the intervals from /C into three groups. First, let us consider intervals from /C that start no 
later than at position 2A. Clearly, in S there are at most — such intervals (recall that m > 0), 
therefore there are at most -^ such intervals in the reduced staircase /C. Each of the intervals has 
the length at most 3m, hence their total length does not exceed 6A. 
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Figure 3: A word w of length n with a /-factorization F = (/i, /2, • • • , /i2)- Here the middle factors 
of F are: /y, /s, /g, /lo, so gf = 4. The segments above illustrate the family K. = Reduce{S, F). By 
the Reduction-Property Lemma, if these segments are short enough and A is small enough, then 
the total length of these intervals, ||/C||, does not exceed ^f^. 

Now consider the intervals from /C which end at some position > n — A-l-1. Each of the intervals 
starts at the position not smaller than: 



n-A + l-3m + l > n- A(n) + 2 



3A 



> n-2A + 2. 



In the second inequality we used the fact that g > 3. Hence, all the considered intervals from /C 
are subintervals of an interval of length 2 A. Exactly in the same way as in the previous case we 
obtain that their total length does not exceed 6A. 

Finally, we consider the intervals from /C which are subintervals of an interval 



a 



[2A + l..n-A]. 



Let S' be the set of intervals obtained from S by extending each interval by m positions to the 
right. Let /j, /j+i, ■ ■ ■ , fj be all the factors of F with a non-empty intersection with the interval a. 
Note that j + 1 — i<g + 2. Then, due to the Staircase-Factors Lemma, we have: 



[{A G /C : A C q;}[ = \Reduce{S, {fi. 
The total length of such intervals does not exceed: 



,/i))| < 4(<7+l). 



A-{g + l)-3m < ^m (12 + i j < 13 • C2n. 

Here, again, we used the fact that g >3. 
In conclusion, we have: 

||/C|| < 2 • 6 • cin -|- 13 • C2n < ^n. D 

From now on we fix the values of the constants ci , C2 , uq as in Lemma HI 



3 Quasiseeds 

Assume ?; is a subword of w. If w can be decomposed into w = xyz, where |x|, \z\ < \v\ and f is a 
cover of y, then we say that f is a quasiseed of w. On the other hand, if w can be decomposed into 
w = xyz, so that |x|, \z\ < \v\ and t; is a seed of wivw^, then i; is a border seed of w. We have the 
following simple observation. 

Observation 1. Let v be a subword of w. The word v is a seed of w if and only if v is a quasiseed 
of w and v is a border seed of w. 



A quasiseed is a weaker and computationally easier version of the seed, the main part of the 
algorithm computes representations of all the quasiseeds. Then from this information the seeds can 
be computed in linear time, applying the border seed condition, by already known methods [71 120) . 
We describe this step in Subsection 13.21 

As a corollary of Lemma [2l we obtain a similar fact for quasiseeds (hence, also for seeds). 

Lemma 5. [Quasiseeds-Factors Lemma] 

Let F be an f -factorization of w and v be a quasiseed of w. Then for a = [2 \v\ + 1 . . \w\ — \v\] we 
have: 



\Fa\ < 



2 \w\ 



A direct consequence of this lemma is one of key facts used to save the work of the algorithm. 
Due to this fact we can skip a large part of computations. 

Lemma 6. [Work- Skipping Lemma] 

Let ci,C2,nQ be as in Lemma^ Let g be the number of middle factors in an f -factorization F of 
the word w, w ^YT^ for n > uq. Then there is no quasiseed v of w such that 

2n 

— < |f| < ClU. (1) 

g 

Proof. Assume to the contrary that there exists a quasiseed v which satisfies the conditions ([T|) . By 
the Quasiseeds-Factors Lemma we obtain that |-Fq,| ■ l^il < 2 \w\ where a = [2[t;[ + l..|it;| — |f|]. Due 
to the condition |t;| < cin = A, we have \Fa\ > |-F[2A+i..n-A]l = 9- We conclude that g ■\v\ < 2 \w\, 
which contradicts the first inequality from ([T|). D 

3.1 Quasigaps 

The suffix tree of the word w, denoted as T, is a compact TRIE of all suffixes of the word wf^, 
for ^ ^ E being a special end-marker. Recall that a suffix tree can be constructed in 0{n) time 
[lOl [12l [HI . For simplicity, we identify the nodes of T (both explicit and implicit) with the subword 
of w which they represent. Leaves of T correspond to suffixes of w; the leaf corresponding to 
w[i . .n] is annotated with i. 

Let us introduce an equivalence relation on subwords of w. We say that two words are equivalent 
if the sets of start positions of their occurrences as subwords of w are equal. Note that this relation 
is very closely bonded with the suffix tree. Namely, each equivalence class corresponds to the set of 
implicit nodes lying on the same edge together with the explicit deeper end of that edge. Quasiseeds 
belonging to the same equivalence class turn out to have a regular structure. To describe it, let us 
introduce the notion of quasigap, a key one for our algorithm. 

Definition 1. Let v be a subword of w. If v is a quasiseed of w, then by quasigap{v) we denote the 
length of the shortest quasiseed of w equivalent to v. Otherwise we define quasigap{v) as oo. 

The following observation provides a characterization of all quasiseeds of a given word using only 
the quasigaps of explicit nodes in the suffix tree T. 

Observation 2. All quasiseeds in a given equivalence class are exactly the prefixes of the longest 
word v in this class (corresponding to an explicit node in the suffix tree) of length at least quasigap{v) . 

Proof. Let u and u' be equivalent to v and such that \u\ < \u'\. It suffices to show that if u is a 
quasiseed, then so is u'. Indeed, since the sets of occurrences of both words are equal, this is a clear 
consequence of the definition of a quasiseed. D 

7 



Example 1. Consider the words w = aaaaaabaaabaaabaaaa, v = aaabaaa. The equivalence class 
of V is E = {aaabaaa, aaabaa, aaaba, aaab} . In E quasiseeds are aaabaaa, aaabaa, aaaba. Among 
these quasiseeds only v is a border seed of w, hence a seed of w (in this case the only one). All 
quasiseeds in E are the prefixes of v of length at least 5, hence quasigap{v) = 5. 

3.2 From quasigaps to seeds 

In this section we show how to reduce finding all seeds of w.to computing quasigaps. For this, due 
to Observation [H we need to identify all border seeds among quasiseeds of w. By Observation O 
the set of quasiseeds consists of a family of ranges {v[l . . k] : quasigap(w) < k < \v\} on an edge 
{u, v) of T. We call such a range a candidate set. 

The approach presented in |20j (section "Finding hard seeds" ) can now be used. This algorithm 
takes as its input all candidate sets of w and of w^, where w^ is the reversed w. It returns all 
the seeds of w represented as ranges on the edges of suffix trees of both w and w^. Alternatively, 
one can exploit an algorithm presented in [3, which, given a family of candidate sets of w, finds a 
shortest border seed of w belonging to this family, hence a shortest seed. 

Both solutions run in linear time. Therefore, we have reduced the problem of finding seeds to 
computing quasigaps of explicit nodes in T. Hence, as soon as we show that this can be done in 
0{n) time (Theorem [2]), we obtain the main result of the paper. 

Theorem 1. An 0{n)-size representation of all the seeds of w can be found in 0{n) time. In 
particular, a shortest seed can be computed within the same time complexity. 

4 Generalization to arbitrary intervals 

The algorithm computing quasigaps (which are a representation of quasiseeds) has a recursive 
structure. Unfortunately, the relation between quasiseeds in subwords of w and quasiseeds of entire 
w is quite subtle and we have to deal with technical representations of quasiseeds on the suffix tree. 
Even worse, the suffix trees of subwords of w are not exactly subtrees of the suffix tree of w. Due 
to these issues, our algorithm operates on subintervals of [1 . . n] rather than subwords of w. 

For an interval 7 = [«..j], we can introduce the notions of an /-factorization, an interval 
staircase covering 7 and a reduced staircase in a natural way exactly as the corresponding notions 
for the subword w[i . . j]. 

4.1 Induced sufRx trees 

Let us define an induced suffix tree T{'y) for 7 = [i. .j]. It is obtained from T as follows. First, 
leaves labelled with numbers from 7 and all their ancestors are selected, and all other nodes are 
removed. Then, the resulting tree is compacted, that is, all non-branching nodes become implicit. 
Still, the nodes (implicit and explicit) of such a tree can be identified with subwords of w. Of 
course, T{'y) is not a suffix tree of w[i . . j]. 

By Nodes(T(7)) we denote the set of explicit nodes of T{'y). Let u be a (possibly implicit) node 
of T{'~f). By parent{v,j) we denote the closest explicit ancestor of v. We assume that the root 
the only node that is an ancestor of itself. Additionally, for an implicit node v of T{'j), we define 
desc{v, 7) as the closest explicit descendant of v. If v is explicit we assume desc{v, 7) = v. 



4.2 Quasigaps in an arbitrary interval 

Here we extend the notion of quasigaps to a given interval 7. We provide definitions which are 
technically more complicated, however computationally more useful. 

Before that we need to develop some notation. Let v be an arbitrary subword of w, or equiva- 
lently a node of T. By Occ{v,j) we denote the set of those starting positions of occurrences of v 
that lie in the interval 7. Note that the set Occ{v, 7) represents the set of all leaves of T{'y) in the 
subtree rooted at v. Let first (v, 7) = min Occ{v, 7) and last{v, 7) = max Occ{v, 7). Here we assume 
that min = +00, max = —00. A maximum gap of a set of integers X = {ai, 02, . . . , ak}, where 
ai < a2 < ■ ■ ■ < ak, is defined as: 

maxgap(X) = maxjoj+i — Oj : 1 < i < A;} or if \X\ < 1. 

For simplicity we abuse the notation and write maxgap(t;,7) instead of rr\axgap{Occ{v,^)). Now 
we are ready for the the main definition. 

Definition 2. Let v be a subword of w, and 7 = [i ■ ■ j] be an interval. Denote l\ = first{v,'j) = 
min Occ{v,^), I2 = last{v,^) = max Occ{v,^). Let: 



M = max ( maxgap{v, 7), £1 — i + 1, 



+ 1, \parent{v,-f)\ + l . (2) 



If M < \v\ then we define quasigap{v,j) = M, otherwise quasigap{v,j) = 00. 
If quasigap(v, 7) 7^ 00 then we call v a quasiseed in 7. 

A careful but simple analysis of this definition lets us observe that quasiseeds in 7 are also 
quasiseeds of w[i . . j]. Hence, the generalized versions of the Quasiseeds-Factors Lemma and the 
Work-Skipping Lemma hold for arbitrary intervals 7. What is more, quasiseeds of w are also qua- 
siseeds in [1 . . n]. For arbitrary 7 a similar statement does not hold, roughly because quasigap(u, 7) 
may not be equal to quasigap(w) restricted to the word w[i..j], since the set Occ{v,^) may also 
depend on at most |f | letters following 7. 

Nevertheless quasigap(f , [1 . . n]) = quasigap(v) and for an arbitrary interval the quasiseeds in 7 
lying on a single edge of T{'y) can still be described by a quasigap of the explicit deeper end of this 
edge. More precisely, for u = desc{v,'y), we have: 

fquasigap(u,7) if |w| > quasigap(u,7), 
quasigap(i;,7) = <^ . (3) 

I 00 otherwise. 

5 Main algorithm 

In this section we provide a recursive algorithm that given an interval 7 C [1 . . n], I7I = N, and a 
tree T^'j) computes quasigaps of all explicit nodes of T{'y); we call such nodes 'j -relevant. 

First, we compute an /-factorization F of the word w[i . . j]. Assume that F contains g middle 
factors. Let us fix rri = -^ and A = [ci A^J . We divide the values of finite quasigaps of 7-relevant 

nodes into values exceeding m (large quasigaps) and the remaining (small quasigaps) not exceeding 
m. Note that if m = then all quasigaps of 7-relevant nodes are considered large. 

Due to the Work-Skipping Lemma, the large quasigaps can be divided into two ranges: 



2N 



U [A,N] C [m,50m] U [A,50A]. 



Thus all large quasigaps can be computed using the algorithm ComputeInRange(l, r, 7), described 
in Section [HI This algorithm computes all quasigaps of 7-relevant nodes which are in the range [/, r] 
in 0{N(1 + log j)) time. We apply it for r = 50/, obtaining 0{N) time complexity. 

If rn, = then there are no small quasigaps. Otherwise the small quasigaps are computed 
recursively as follows. We start by introducing an m-staircase S of intervals covering 7 and reduce 
the staircase S with respect to F to obtain a family fC, see also Fig. [2j Recall that, by the 
Reduction-Property Lemma, ||/C|| < N/2. We recursively compute the quasigaps for all intervals 
A G /C. However, before that, we need to create the trees r(A) for A € /C, which can be done in 
0{N) total time using the procedure ExtractSubtrees(7, /C), described in Section [6l Finally, the 
results of the recursive calls are put together to determine the small quasigaps in T{^). This Merge 
procedure is described in Section [71 Algorithm [1] briefly summarizes the structure of the MAIN 
algorithm. 

Algorithm 1: Recursive procedure MAIN(7) 

Input: An interval 7 = [i . . j] and the tree T{'y) induced by suffixes starting in 7. 
Output: quasigap(T;,7) for 7-relevant nodes of T(7). 

1 if ItI ^ ^0 then apply a brute-force algorithm for quasigaps in T{'y); return 

2 F := /-factorization (t(;['i ■ ■ j]) 

3 A := [I7I/5OJ ; g := the number of factors of F contained in [i + 2A . . j — A] 
4: m := [n/(50(7)J 

5 if m > then ComputeInRange(m, SOm, 7) { large quasigaps } 

6 ComputeInRange(A, 50A, 7) { large quasigaps } 

7 if 771 = then return { no small quasigaps, hence no recursive calls } 

8 5:= IntervalStaircase(7, m) 

9 /C := Reduce(5,-F) { the total length of intervals in /C is < [7I/2 } 

10 ExtractSubtrees(7, /C) { prepare the trees T(A) for A G /C } 

11 foreach A G /C do MAIN(A) 

12 Merge(7, /C) { merge the results of the recursive calls } 



Theorem 2. The algorithm MAIN works in 0{N) time, where N = \"f\. 

Proof. All computations performed in a single recursive call of MAIN work in 0{N) time. These 
are: computing the /-factorization in line 2 (see [H]), computing large quasigaps in lines 5-6 using 
the function ComputelnRange (see the Section [8]) , computing the family of working intervals fC and 
the trees T{\) for A G /C in lines 8-10 (see Lemma [7] in Section [6]) and merging the results of the 
recursive calls in line 12 of the pseudocode (see Lemma [TOl in Section [7|). We perform recursive calls 
for all A G /C, the total length of the intervals from K, is at most N/2 (by the Reduction-Property 
Lemma). We obtain the following recursive formula for the time complexity, where M = \1C\ and 
C is a constant: 



M M 

N_ 
2 • 



Time{N) < C ■ N + ^ Time{Ni), where iVi > 0, ^Ni< 



This formula easily implies that Time{N) < 2C ■ N. D 

In the following three sections we fill in the description of the algorithm by showing how to imple- 
ment the functions ExtractSubtrees, Merge and ComputelnRange efficiently. 



10 



6 Implementation of Extract Subtrees 

In this section we show how to extract all subtrees T(A) from the tree ^(7), where A € /C. 

Let A C 7. Assume that all integer elements from the interval A are sorted in the order of the 
corresponding leaves of T{'y) in a left-to-right traversal: (ii, . . . ,?Af)- Then the tree T(A) can be 
extracted from T{'y) in 0(|A|) time using an approach analogous to the construction of a suffix tree 
from the suffix arrays [10^ I12j . see also Fig. [H In the algorithm we maintain the rightmost path 
P of T(A), starting with a single edge from the leaf corresponding to ii to the root of T{-y). For 
each ij, j = 2, . . . ,M, we find the (implicit or explicit) node u of P of depth equal to the lowest 
common ancestor (LCA) of the leaves ij-i and ij in T('y), traversing P in a bottom-up manner. 
The node u is then made explicit, it is connected to the leaf ij and the rightmost path P is thus 
altered. Recall that lowest common ancestor queries in ^(7) can be answered in 0(1) time with 
0{N) preprocessing time \lE\ . 




Figure 4: ExtractSubtrees(7, fC) for 7 = [3 . . 11] and /C = {[5 . . 7], [7 . . 9], [9 . . 11]}. 

Thus, to obtain an 0{N) time algorithm computing all the trees T{X), it suffices to sort all 
the elements of each interval A € /C in the left-to-right order of leaves of T{^). This can be done, 
using bucket sort, for all the elements of /C in 0{\^\ + \\}C\\) = 0{N) total time. Thus we obtain 
the following result. 

Lemma 7. ExtractSubtreesfjjIC) can be implemented in 0{\j\ + \\JC\\) time. 

7 Implementation of Merge 

In this section we describe how to assemble results of the recursive calls of MAIN to determine 
the small quasigaps. Speaking more formally, we provide an algorithm that given an interval 7, a 
positive integer m, IC ~ a reduced m-staircase of intervals covering 7 and quasigaps of explicit nodes 
in r(A) for all A € /C, computes those quasigaps of explicit nodes in T{-y), that are not larger than 
m. More precisely, for every explicit node v in T{'y), the algorithm either finds the exact value of 
quasigap(i',7) or says that it is larger than ni. In the algorithm we use the following crucial lemma 
which provides a way of computing quasigap(t;,7) using the quasigaps of the nodes from T(A) for 
all X G IC, provided that quasigap(t;,7) < m. Its rather long and complicated proof can be found 
in the last section of the paper. 

Lemma 8. Let v be an explicit node in T('y). 

(a) If V is not an implicit or explicit node in T{X) for some A G /C then quasigap{v,j) > m. 

(b) If \parent{v,^)\ > m then quasigap{v,^) > m. 
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(c) If the conditions from (a) and (h) do not hold, let 

M'{v) = m.8iK{quasigap{desc{v, X), X) : X G /C}. 
If M'{v) < niin(r?2, \v\) then quasigap{v,j) = M'{v), otherwise quasigap{v,'y) > m. 

To obtain an efficient implementation of the criterion from Lemma [H we utilize the following 
auxiliary problem. Note that the "max" part of this problem is a generalization of the famous 
Skyline Problem for trees. 

Problem 1 (Tree-Path-Problem). Let T be a rooted tree with q nodes. By Py^u we denote the 
path from v to its ancestor u, excluding the node u. Let V he a family of paths of the form Pv,u, 
each represented as a pair {v,u). To each P £V a weight w{P) is assigned, we assume that w{P) 
is an integer of size polynomial in n. For each node v of T compute u\ay.{w{P) : v € P} and 
Y.P:v&pw{P)- 

Lemma 9. The Tree- Path- Problem can be solved in 0{q+ ["PI) time. 

Proof. Consider first the "max" part of the Tree-Path-Problem, in which we are to compute, for 
each V G Nodes(T), the maximum of the u;- values for all paths P ^ V containing v (denote this 
value as W^a,y.{v)). We will show a reduction of this problem to a restricted version of the find/union 
problem, in which the structure of the union operations forms a static tree. This problem can be 
solved in linear time |15] . 

We will be processing all elements of V in the order of non-increasing values of w. We store a 
partition of the set Nodes(T) into disjoint sets, where each set is represented by its topmost node in 
T (called the root of the set). Initially each node forms a singleton set. Throughout the algorithm 
we will be assigning PVmax values to the nodes of T, maintaining an invariant that nodes without 
an assigned value form single-element sets. 

When processing a path Py^u € V, we identify all the sets 81,82, ■■■, Sk which intersect the 
path. For this, we start in the node v, go to the root of the set containing v, proceed to its parent, 
and so on, until we reach the set containing the node u. For all singleton sets among 8i, we set the 
value Wmax of the corresponding node to w^Py^u), provided that this node was not assigned the 
^max value yet. Finally, we union all the sets 8i. 

Note that the structure of the union operations is determined by the child-parent relations in 
the tree T, which is known in advance. Thus all the find/union operations can be performed in 
linear time [15], which yields 0{q -\- jT'l) time in total. 

Now we proceed to an implementation of the "-I-" part of the Tree-Path-Problem (we compute 
the values W+{v)). This time, instead of considering a path Py^u with the value w{Py^u), we 
consider a path Pv,root with the value w{Py^u) and a path Pu,root with the value —w{Py^u)- Now 
each considered path leads from a node of T to the root of T. 

For each u G Nodes(T) we store, as W'{u), the sum of weights of all such paths starting in u. 
Note that W+{v) equals the sum of W'{u) for all u in the subtree rooted at v. Hence, all W+ values 
can be computed in 0{n) time by a simple bottom-up traversal of T. D 

Now let us explain how the implementation of Merge can be reduced to the Tree-Path-Problem. 
In our case T = T{'^). Observe that for each A G /C an edge from the node u down to the node 
V in T{X) induces a path Py^u in ^(7)- Let V he & family of all such edges. If we set the weight 
of each path Py^u corresponding to an edge {u,v) in T(A) to 1, we can identify all nodes v' which 
are either explicit or implicit it each r(A) for A G /C as exactly those nodes for which the sum of 
the corresponding w{P) values equals |/C|. On the other hand, if we set the weight of such path to 
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quasigap(f , A), we can compute for each v' explicit in ^(7) the maximum of quasigap((iesc(7;', A), A) 
over such A that v' is an exphcit or imphcit node in r(A). In particular for nodes identified in 
the previous part, this maximum equals M'{v'). All the remaining conditions from Lemma [8] can 
trivially be checked in time proportional to the size of T{'y), that is, in 0(|7|) time. Note that 
\V\ = 0(||/C|[) and all weights are from the set [0. . I7I] U {00}, and therefore can be treated as 
bounded integers. Thus, as a consequence of Lemma [U we obtain the following corollary. 

Lemma 10. Merge('y,IC) can be implemented in 0{\^\ + ||/C||) time. 

8 Implementation of ComputelnRange 

In this section we show how to compute quasigaps in a range [/, r] for 7-relevant nodes of ^(7). More 
precisely, for each v € Nodes(r(7)) we either compute the exact value of quasigap(t>,7), report that 
quasigap(t),7) < / or that quasigap(t',7) > r, we call such values [l,r\-restricted quasigaps. Note 
that if we made the range larger, we still solve the initial problem, hence we may w.l.o.g. assume 
that J is a power of 2. This lets us split the range [/,r] into several ranges of the form [d, 2d]. 
Below we give an algorithm for such intervals running in 0{N) time (where N = I7I), which for an 
arbitrary range gives 0{N{1 + log j)) time complexity. 

For many nodes v we can easily check that quasigap(v) > 2d. Thus, we may limit ourselves to 
the nodes which we call plausible^ that is, such v G Nodes(T(7)) that 

\Occ{v,j)\>^ A first {v, -f) <i + 2d A last{v,-f) > j - Ad+ 1. (4) 

If V is not a plausible node then certainly quasigap(?;,7) > 2d. 

Observe that if v is plausible then parent{v,^) is also plausible. Hence, plausible nodes induce 
a subtree of T. Let us call the branching nodes of this tree active. More precisely, the active nodes 
are: the root of T{'y) and the plausible nodes which have either none of more than one plausible 
child nodes. Non-active plausible nodes are called regular plausible nodes. Obviously, all active and 
plausible nodes of T{'y) can be found in 0{N) time. The following lemma shows that the number 
of active plausible nodes is very limited. 

Lemma 11. The tree T('y) contains 0{d) active nodes. 

Proof. Let us divide the set of all active nodes of T(7) into the active nodes having plausible 
children (the set X) and all the remaining active nodes (the set Y). Apart from at most one node 
(the root of T(7)), each node from X has at least two child subtrees containing active nodes, hence 
\X\ < \Y\. The size of the subtree of r(7) rooted at any node from the set Y is i^{N/d) and all 
such subtrees are pairwise disjoint. Hence, |y| = 0{d) and \X\ + |y| = 0{d). D 

Quasigaps of plausible nodes will be computed directly from ^. Note that the only computa- 
tionally difficult part of this equation is the maxgap. However, the [d, 2d] -restricted maxgaps for 
all plausible nodes can still be found in linear time. Precisely, for a plausible node v, we either find 
the exact maxgap(f,7) or report that maxgap(t;,7) < d or that maxgap(f,7) > 2d. 

The main idea of the algorithm computing restricted maxgaps is to have a bucket for each d 
consecutive elements of 7 (with the last bucket possibly smaller). Note that, since the gaps between 
elements of the same bucket are certainly smaller than d, the [d, 2(i]-restricted maxgap(w,7) depends 
only on the first and the last element of each bucket. Due to the small number of active nodes, 
this observation on its own lets us compute [d, 2(i]-restricted maxgaps of all active nodes in 0{N) 
time. The computation for regular plausible nodes requires more attention. We use the fact that 
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Figure 5: A sample tree with active nodes marked in black and regular plausible nodes marked in 
grey. 

all such nodes can be divided into 0{d) disjoint paths connecting pairs of active nodes to develop 
an algorithm for more efficient bucket processing. 

For each plausible node we define the list Ll{v) which consists of all I € Occ(f ,7) such that the 
path from v to / contains only one plausible node (namely v). For each active node we define the 
list L2{v) which consists of all / € Occ{v, 7) such that the path from v to I contains only one active 
node (again, the node v). A sample tree with active and plausible nodes along with both kinds of 
lists marked is presented in Fig. [5j Note that each leaf can be present in at most one list LI, and 
at most one list L2, and additionally all the lists LI and L2 can be constructed in 0{N) total time 
by a simple bottom-up traversal of ^(7). For each active node we introduce the set active-desc(f ) 
(immediate active descendants) which consists of all active descendants u of v such that the path 
from f to u contains no active nodes apart from v and u themselves. 

Algorithm 2: Computing restricted maxgap values for active nodes 
Input: A suffix tree T{'~f), and an integer value d. 

Output: The suffix tree with active nodes annotated with maxgap values, for some nodes 
we can use labels " < d" or " > 2d'" . 

1 for V € active-nodes(T{'y)) (from the bottom to the top) do 

2 Initialize B^ with empty doubly-linked lists 

3 foreach u G active-desc{v) do UpdateBuckets(i?^, Bu) 

4 UpdateBuckets(i?t,, L2{v)) 

5 Replace each By[k] of size at least 3 with {head{B^[k])^ tail[By[k])} 

6 Compute maxgap(i;) using the contents of 5^, use labels "< d" or "> 2d" if the result is 
outside the range [d, 2d\ 



The algorithm first computes the [d, 2(i]-restricted maxgaps of all active nodes, and later 
on for all the regular plausible nodes. For each active node f , Algorithm [2] computes an ar- 
ray i?t,[0.. [n/d]], such that By[k] contains the minimal and the maximal elements in the set 
Occ{v, [i + d ■ k . .i + d ■ {k + 1) — 1]), provided that this set is not empty. To fill the B^ array, the 
algorithm inspects all elements of L2{v) and the arrays B^ for all u € active-desc(7;). For this, an 
auxiliary procedure UpdateBuckets(i?, L) is utilized, in which, while being updated, each bucket 
By always contains an increasing sequence of elements. Afterwards, the [d, 2(i] -restricted maxgap 
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of V can be computed by a single traversal of the B^ array. 



Algorithm 3: UpdateBuckets(i?, L) procedure 



Input: An array of buckets B and a list of occurrences L. 
Output: The buckets from B updated with the positions from L. 

1 foreach Z € L do 

2 Let k be the bucket assigned to the position / 

3 if empty{B[k]) then B[k] := {1} 

4 else if / < head{B[k]) then add I to the front of B[k] 

5 else if / > tail{B[k]) then add / to the back of B[k] 

6 else we can ignore / 



Claim 1. Algorithmic has time com,plexity 0{N). 

Proof. The total size of all the arrays B^ is 0{N), since there are 0{d) active nodes and each such 
array has 0{n/d) elements, each of constant size. We only need to investigate the total time of all 
UpdateBuckets(-B^,, L) calls (note that a single call works in 0(|L|) time). Recall that the total 
length of all the lists L2 is 0{N), therefore the calls in line 4 of the algorithm take 0{N) total 
time. Similarly, each array Bu is used at most once in a call of UpdateBuckets(i?^, Bu) from line 
3, so this step also takes 0{N) time in total. D 

Now we proceed to the computation of [d, 2(i] -restricted maxgaps for regular plausible nodes. We 
process them in maximal paths pi, ■ ■ ■ ,Pa, such that po = v is an active node and p^ = parent(pk_i). 
Here we also store an array of buckets B, but this time each B[k] may contain more than 2 elements 
(stored in a doubly-linked list). Starting from By, we update B using the lists Ll{pi), . . . ,Ll{pa). 
The [d, 2(i]-restricted maxgap for pa is computed, by definition, from the array B, but the com- 
putations for pa^i, ■ ■ ■ ,pi require more attention. This time we remove elements of the lists 
Ll{pa), ■ ■ ■ ,Ll{pi) from the buckets B. If we remove an occurrence / corresponding to bucket 
B[q], the [d, 2(i] -restricted maxgap value can be altered only if / is the head or the tail of the list 
B[q]. In this case, if / is neither the globally first nor the globally last occurrence, the [d, 2(i]- 
restricted maxgap can only increase, and we can update it in constant time by investigating a 
few neighbouring buckets. Otherwise we recompute the [d, 2(i]-restricted maxgap using all buckets. 
Note that due to the restriction on first and last in the definition of plausible nodes ([4]) , the latter 
case may happen only if we delete something from one of the first few or the last few buckets, hence 
this case is handled in line 15 of the algorithm. 

Claim 2. Algorithm\^has time complexity 0{N). 

Proof. Since the total length of the lists LI is 0{N), updating the buckets takes 0{N) total time. 
Hence, it suffices to show that the total time of computing maxgaps is 0{N). First, computation 
from scratch, taking 0{N/d) time, is preformed 0{d) times, since there are only 0{d) active nodes 
(line 7) and also 0{d) occurrences close to the beginning or the end of 7 (line 15). Finally, the 
recomputation in line 17 is performed 0{N) times in total and each such step takes 0(1) time (note 
that it suffices to consider only the peripheral elements of each of the buckets). D 

Thus we obtain the following lemma. 

Lemma 12. For the tree T{'y) and any constant d, the [d,2d]-restricted quasigaps for all nodes of 
Nodes(T(7)) can be computed in 0{N) time. 
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Algorithm 4: Computing restricted maxgap values for regular plausible nodes 

Input: A suffix tree T{'y), and an integer value d. 

Output: The suffix tree with all plausible nodes annotated with [d, 2(i]-restricted maxgap 
values. 
1 Compute first{v,j) and last{v,^) for each node of T{'^) 
1 Compute buckets B^ and maxgap for all active nodes 

3 foreach v G active-nodes{T{'y)) do 

4 Let pi, . . . ,pa be the maximal path of regular plausible nodes defined as po = v, 
Pk = parent (pk-i,^) 

5 B .^ B^ 

6 for k = 1 to a do UpdateBuckets(i?, Ll{pk)) 

7 curr := compute maxgap using the buckets B 

8 for A; = a to 1 do 

9 maxgap(pfc,7) := curr 

10 foreach / G Ll{pf^) do 

11 let q be the bucket assigned to the occurrence I 

12 if / = head{B[q]) or I = tail{B[q]) then 

13 remove / from B[q] 

14 a I < i + 2d or I > j - M + l then 

15 I curr := recompute maxgap from the buckets B 

16 else 

17 I curr := recompute maxgap from curr using the contents oi B[q — 2 . . q + 2] 

18 Replace maxgap values outside range [d, 2d] with labels "< d" or "> 2d". 



Proof. We compute [d, 2d]-restricted maxgap values for the plausible nodes of T{'y) using Algo- 
rithm |H Given the maxgap values we can compute [d, 2d] -restricted quasigap values using formula 
([2]) (in 0(1) time for each node). For non-plausible nodes we set [d, 2d]-restricted quasigap to oo. 
Claims [U and [2] imply that the running time is linear. D 

9 Proof of Lemma [8] 

Lemma [8] is a deep consequence of the definitions of quasigap, /-factorization, interval staircase 
and reduction with respect to a factorization. Its proof is rather intricate, hence we conduct it in 
several steps. 

We start with two simple auxiliary claims. The first one concerns monotonicity of quasigap 
with respect to the interval. The other binds the quasigaps in two intervals of equal length such 
that extensions of the corresponding subwords are equal. 

Claim 3. Let v be a subword of w, X Q j be two intervals. If quasigap(v, A) is finite, then 
quasigap{v,X) < quasigap{v,'j). 

Proof. We prove the lemma straight from the definition of quasigaps, i.e., the formula ([2|). 

Let A = [i' ■ ■ j'] and 7 = [i..j]. Observe that the set Occ(w,7) is obtained from Occ{v,X) by 
adding several elements less than i' and greater than j' . Note that in this process the maxgap 
cannot decrease, therefore maxgap(t',A) < maxgap(w,7). 

Note that r(A) is an induced subtree of r(7), so the explicit nodes of T{'y) form a superset of 
the explicit nodes of T(A). Hence, \parent{v,'y)\ > \parent{v, X)\. 
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Now we show that 

first{v, X) — i' + 1 < max(maxgap(7;,7), first{v,j) — i + 1), 

note that first{v,X) < oo, since quasigap(t', A) is expected to be finite. Consider two cases. If 
first{v,X) = first (v,^) then 

first{v, A) — z' + 1 < first{v, 7) — i + 1. 

Otherwise, let x G Occ{v,'y) be the largest element of this set which is less than i' . Then 

first{v,X) — i' + 1 < first{v,X) — X < maxgap(w,7). 

Similarly, one can prove that 



j' — last{v, A) 



+ 1 < max I maxgap(T;,7), 



j - last{v,-f) 



+ 1 . D 



Claim 4. Let v be a subword of w. Let < k < n — \v\ and 1 < i,i' < n — \v\ — k be such integers, 
that w[i . .i + k + \v\] = w[i' . A' + k + \v\]. Then, quasigap{v, [i . .i + k]) = quasigap{v, [i' . .i' + k]). 

Proof. Note that the induced suffix trees T{[i . .i + k]) and T{[i' . .i' + k]) are compacted TRIEs of 
words {w[i . . n]#,w[i + l . . n]#, . . . ,w[i + k . . n]#} and words {w[i' . . n]#,w[i' + 1 . . n]#, . . . , w[i' + 
k . . n]#} respectively. Since wli . . i + k + \v\] = w[i' . . i' + k + \v\], the top |f | + 1 levels ofT{[i . . i + k]) 
and T{[i' . .i' + k]) are identical. Moreover, i + x £ Occ{v, [i . .i + k]) if and only ii i' + x € 
Occ{v, [i' . .i' + k]). Hence, quasigap(f , [i . .i + k]) = quasigap(f , [i' . .i' + k]). D 

For the rest of this section let us fix an interval 7 with an /-factorization F. Recall the 
denotation m = -gS- . Also let S be an ?Ti--staircase covering 7 and /C = Reduce{S, F). 

Lemma [HI which is the final aim of this section, concerns the reduced staircase. Now we prove 
a similar result involving the regular m-staircase. It characterizes all small quasigaps in 7 in terms 
of quasigaps in an interval staircase covering 7. 

Claim 5. Let v be a subword of w and let 

M = in.ax{quasigap{v, A) : X £ S}. 

Lf M < m then quasigap{v,'y) = M, otherwise quasigap{v,j) > m. 

First, let us prove the following claim. 

Claim 6. Let [i . . j] and [i' . . j'] be such intervals, that i < i' < j < j' , j — i' > 2m — 1, and 

M = max{quasigap{v, [i . . j]), quasigap{v, [i' . . /])). 

Lf M < m, then quasigap{v, [i . . j']) = M, otherwise quasigap{v, [i . . j']) > m. 

Proof (of Claiml^. Let O = Occ{v, [i ..j]), O' = Occ{v, [i' . . j']) and O" = Occ{v, [i ../]). Note 
that if M < 771, then [i' . . j] contains an element of Occ{v), and thus 

maxgap(0") = max(maxgap(0), maxgap(O')). 
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Otherwise we have that maxgap(0") > m and consequently quasigap(u,7) > m. 

Further assume that M < m. Clearly min(O) = niin(0") and niax(O') = niax(0"). Also, 

min(O') -i' + 1 < max(maxgap(0"), min(0") - i + I) 

and 

j — max(O) < max(maxgap(0") — 1, j' — max(0")). 

Let us consider the node in T{[i . . j']) corresponding to v. Let us observe that parent{v, [i . . j']) is 
the lower of the nodes parent {v, [i . . j]), parent{v, [i' . . j']). Finally, from the formula ([2]) we obtain 
that quasigap(i;, [z . . j']) = M. D 

Proof (of Claim\B^. The claim is proved by simple induction over \S\. Since the overlap between 
each two consecutive intervals in S is 2m^ the induction step follows from Claim [6l D 

Now, we provide a similar result concerning the reduced staircase instead of the regular staircase. 
Unfortunately, due to the assumptions of Claim |4] binding the length of v and the length of the 
subwords of w which we require to be equal, we need to make an additional assumption limiting 

\v\. 

Claim 7. Let v he such a subword of w, that \v\ < m. Additionally, let 

M = in.ax{quasigap{v, A) : A G /C}. 
If M < m, then quasigap{v,j) = M, otherwise quasigap{v,j) > m. 

Proof. By Claim [H if quasigap(t',7) < m, then quasigap(t>. A) < m for all intervals A G /C, conse- 
quently M < m. Thus if M > m then quasigap(t',7) > m. 

Now, let us assume that M < m. Let the intervals in 5 = {/3i, . . . ,/3r} be ordered from 
left to right, and let us denote Mp = max/j=i pquasigap(i;,/3/i). From Claim [5] we know that 
quasigap(f ,7) = Mr. To prove the lemma, it suffices to show that Mr = M. 

Let /3/i = [i . . j] G 5\/C. By the definition of /C, u;[i . . j+m] occurs in some factor fg of F, and fg 
occurs in w[/3i[J . . .U/3/i_i]. Hence, w[i . . j + m] = w[i' . . i' + j — i + m], for some 1 < i' < 2i — j — m. 
By Claims m and El quasigap(t',/3/i) = quasigap(t', [i' ..i' + j — i]) < Mh-i, and Mh = Mh-i- By 
simple induction, we obtain Mr = M. D 

The claim above does not provide any information on quasigaps of words longer than m. This 
gap needs to be filled in, therefore we prove that the assumption limiting the length can be replaced 
by a similar one involving the length of the parent. This is a major improvement since the length 
of the parent is a part of the definition of quasigaps. As a consequence we can finally characterize 
all small quasigaps in a succinct way. 

Claim 8. Let v he a suhword of w, and let 

M = in.ax{quasigap{v, A) : A G /C}. 

// \parent{v,'y)\ < m and M <m, then quasigap{v,^) = M , otherwise quasigap{v,j) > m. 

Proof. If |f I < ?n,, then the claim is an obvious consequence of Claim [71 So, let us assume that 
\v\ > m. li\parent{v,j)\ > m, then, by ([2]), quasigap(f ,7) > m. Let us assume that \parent{v,'y)\ < 
m < \v\, and let v' = v[l . .m]. Note that for each A G /C we have Occ{v, A) = Occ{v' , A). 

If M < 7TT-, then, by the definition of quasigaps, we have quasigap(i;'. A) = quasigap(u. A) for 
each A G /C. By ClaimjTl M = quasigap(w',7), and due to the formula ([3]), M = quasigap(t',7). 

On the other hand, if M > m, then quasigap(f'. A) = 00 for some A G /C. From this we conclude 
that quasigap(u',7) = 00 and thus quasigap(u,7) > \v'\ = m. D 
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Both Claim [8] and Lemma [8] provide a characterization of all small quasigaps. The first one 
is probably simpler and cleaner but it is the latter that enables computing small quasigaps in an 
efficient and fairly simple manner. 

Lemma [8l Let v be an explicit node in T{'y). 

(a) If V is not an implicit or explicit node in T(A) for some A € /C then quasigap{v,j) > m. 

(h) If \parent{v^^)\ > m then quasigap{v,^) > m. 

(c) If the conditions from (a) and (h) do not hold, let 

M'{v) = m.ax{quasigap{desc{v, X), X) : X E K.}. 

If M'{v) < min(?n,, \v\) then quasigap{v,j) = M'{v), otherwise quasigap{v,^) > m. 

Proof. Let M = max{quasigap(f , A) : A € JC}. Recall the formula ([3]): 

, ,, f quasigap(desc(f, A), A) if If I > quasigap((iesc(w, A), A), 
quasigap(f,A) = < 

I oo otherwise. 

If V is neither an explicit nor an implicit node in T{X) for some A € /C then quasigap(f, A) = cxo an 
thus M = oo. Hence (by Claim[8]) quasigap(f ,7) > m, which concludes part (a) of the lemma. Let 
us assume otherwise, that v € T{X) for all A E /C. Then we have: 

(m'(v) if \v\ > M'(v), 
M = { ^ ^ ' ' ~ ^ ' 
I 00 otherwise. 

Indeed, if M'{v) > \v\ then, by ^, M = 00 and, by Claim [8l quasigap(v,7) > m. Otherwise, if 
M'(v) < \v\, due to ([3]) we have M = M'{v). Then the statement of the lemma becomes the same 
as the statement of Claim [HI D 
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